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Abstract 

We describe a framework and equations used to model and predict the 
behavior of multi-agent systems (MASs) with learning agents. A difference 
equation is used for calculating the progression of an agent's error in its 
decision function, thereby telling us how the agent is expected to fare in 
the MAS. The equation relies on parameters which capture the agent's 
learning abilities, such as its change rate, learning rate and retention rate, 
as well as relevant aspects of the MAS such as the impact that agents 
have on each other. We validate the framework with experimental results 
using reinforcement learning agents in a market system, as well as with 
other experimental results gathered from the AI literature. Finally, we 
use PAC-theory to show how to calculate bounds on the values of the 
learning parameters. 

Multi- Agent Systems, Machine Learning, Complex Systems. 

1 Introduction 

With the steady increase in the number of multi-agent systems (MASs) with 
learning agents 0, 0, Hfl l3l| the analysis of these systems is becoming increas- 
ingly important. Some of the research in this area consists of experiments where 
a number of learning agents are placed in a MAS, then different learning or sys- 
tem parameters are varied and the results are gathered and analyzed in an 
effort to determine how changes in the individual agent behaviors will affect the 
system behavior. We have learned about the dynamics of market-based MASs 
using this approach |34j . However, in this article we will take a step beyond 
these observation-based experimental results and describe a framework that can 
be used to model and predict the behavior of MASs with learning agents. We 
give a difference equation that can be used to calculate the progression of an 
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Figure 1: The agents in a MAS. 



agent's error in its decision function. The equation relies on the values of pa- 
rameters which capture the agents' learning abilities and the relevant aspects 
of the MAS. We validate the framework by comparing its predictions with our 
own experimental results and with experimental results gathered from the AI 
literature. Finally, we show how to use probably approximately correct (PAC) 
theory to get bounds on the values of some of the parameters. 

The types of MAS we study are exemplified by the abstract representation 
shown in Figure ^ We assume that the agents observe the physical state of 
the world (denoted by w in the figure) and take some action (a) based on their 
observation of the world state. An agent's mapping from states to actions is 
denoted by the decision function (<5) inside the agent. Notice that the "physical" 
state of the world includes everything that is directly observable by the agent 
using its sensors. It could include facts such as a robot's position, or the set 
outstanding bids in an auction, depending on the domain. The agent does not 
know the decision functions of the other agents. After taking action an agent 
can change its decision function as prescribed by whatever machine-learning 
algorithm the agent is using. 

We have a situation where agents are changing their decision function based 
on the effectiveness of their actions. However, the effectiveness of their actions 
depends on the other agents' decision functions. This scenario leads to the 
immediate problem: if all the agents are changing their decisions functions then 
it is not clear what will happen to the system as a whole. Will the system 
settle to some sort of equilibrium where all agents stop changing their decision 
functions? How long will it take to converge? How do the agents' learning 
abilities influence the system's behavior and possible convergence? These are 
some of the questions we address in this article. 

Section [3 presents our framework for describing an agent's learning abilities 
and the error in its behavior. Section [5] presents an equation that can be used 
to predict an agent's expected error, as a function of time, when the agent is 
in a MAS composed of other learning agents. This equation is simplified in 
Section 01 by making some assumptions about the type of MAS being modeled. 
Section0then defines the last few parameters used in our framework — volatility 
and impact. Section Ogives an illustrative example of the use of the framework. 
The predictions made by our framework are verified by our own experiments, as 
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shown in Section [7| and with the experiments of others, as shown in Section |SJ 
The use of PAC theory for determining bounds on the learning parameters is 
detailed in Section [5] Finally Sectional describes some of the related work and 
Section HI summarizes our claims. 

2 A Framework for Modeling MASs 

In order to analyze the behaviors of agents in MASs composed of learning 
agents, we must first construct a formal framework for describing these agents. 
The framework must state any assumptions and simplifications it makes about 
the world, the agents, and the agents' behaviors. It must also be mathemati- 
cally precise, so as to allow us to make quantitative predictions about expected 
behaviors. Finally, the simplifications brought about because of the need for 
mathematical precision should not be so constraining that they prevent the ap- 
plicability of the framework to a wide variety of learning algorithms and different 
types of MASs. We now describe our framework and explain the types of MASs 
and learning behaviors that it can capture. 

2.1 The World and its Agents 

A MAS consists of a finite number of agents, actions, and world states. We 
let N denote the finite set of agents in the system. W denotes the finite set of 
world states. Each agent is assumed to have a set of perceptors (e.g., a camera, 
microphone, bid queue) with which it can perceive the world. An agent uses is 
sensors to "look" at the world and determine which world state w it is in; the 
set of all these states is W. Ai, where \Ai\ > 2, denotes the finite set of actions 
agent i € N can take. 

We assume discrete time, indexed in the various functions by the superscript 
t, where t is an integer greater than or equal to 0. The assumption of discrete 
time is made, for practical reasons, by a number of learning algorithms. It 
means that, while the world might be continuous, the agents perceive and learn 
in separate discrete time steps. 

We also assume that there is only one world state w at each time, which all 
the agents can perceive in its completeness. That is, we asume the enviroment 
is accessible (as defined in [2(| p46]). This assumption holds for market systems 
in which all the actions of all the agents are perceived by all the agents, and 
for software agent domains in which all the agents have access to the same 
information. However, it might not hold for robotic domains where one agent's 
view of the world might be obscured by some physical obstacle. Even in such 
domains, it is possible that there is a strong correlation between the states 
perceived by each agent. These correlations could be used to create equivalency 
classes over the agents' perceived states, and these classes could then be used 
as the states in W. 

Finally, we assume the environment is determistic [2(| p46]. That is, the 
agents' combined actions will always have the expected effect. Of course, agent 
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Figure 2: Action/Learn loop for an agent. 



i might not know what action agent j will take so i might not know the eventual 
effect of its own individual action. 

2.2 A Description of Agent Behavior 

In the types of MASs we are modeling, every agent i perceives the state of the 
world w and takes an action a*, at each time step. We assume that every agent's 
behavior, at each moment in time, can be described with a simple state-to- 
action mapping. That is, an agent's choice of action is solely determined by its 
current state-to-action mapping and the current world w. 

Formally, we say that agent i's behavior is represented by a decision func- 
tion (also known as a "policy" in control theory and a "strategy" in game 
theory), given by 8\ : W — > Aj. This function maps each state w G W to the 
action ai G A4 that agent i will take in that state, at time t. This function can 
effectively describe any agent that dcterministically chooses its action based on 
the state of the world. Notice that the decision function is indexed with the 
time t. This allows us to represent agents that change their behavior. 

The action agent i should take in each state w is given by the target func- 
tion A* : W — ► Ai, which also maps each state w G W to an action a, £ 4{. 
The agent does not have direct access to its target function. The target function 
is used to determine how well an agent is doing. That is, it represents the "per- 
fect" behavior for a given agent. An agent's learning task is to get its decision 
function to match its target function as much as possible. 

Since the choice of action for agent i often depends on the actions of other 
agents, the target function for i needs to take these actions into account. That 
is, in order to generate A*, one would need to know Sj(w) for all j G 2V_j and 
w G W. These 5j(w) functions tell us the actions that all the other agents will 
take in every state w. For example, in order for one to determine what an agent 
should bid in every world w of an auction-based market system, one will need 
to know what the other agents will bid in every world w. One can use these 
actions, along with the state w, in order to identify the best action for i to take. 

An agent's Sj(w) can change over time, so that <5- +1 ^ 8\. These changes 
in an agent's decision function reflect its learned knowledge. The agents in 
the MASs we consider are engaged in the discrete action/learn loop shown in 
Figure El The loop works as follows: At time t the agents perceive a world 
w l G W which is drawn from a fixed distribution T>(w). They then each take 
the action dictated by their S\ functions; all of these actions are assumed to 
be taken effectively in parallel. Lastly, they each receive a payoff which their 
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respective learning algorithms use to change the Sj so as to, hopefully, better 
match A*. By time t + 1, the agents have new (5* +1 functions and are ready to 
perceive the world again and repeat the loop. Notice that, at time t, an agent's 
A* is derived by taking into account the <5* of all other agents j <E iV_j. 

We assume that V(w) is a fixed probability distribution from which we take 
the worlds seen at each time. This assumption is not unreasonably limiting. 
For example, in an economic domain where the new state is the new good being 
offered, or in an episodic domain where the agents repeatedly engage in different 
games (e.g. a Prisoner's Dilemma competition) there is no correlation between 
successive world states or between these states and the agents' previous actions. 
However, in a robotic domain one could argue that the new state of the world 
will depend on the current state of the world; after all, the agents probably 
move very little each time step. 

Our measure of the correctness of an agent's behavior is given by our error 
measure. We define the error of agent j's decision function as 

e(Sl) = £ V(w)Pr[Sl(w) + A\(w)] 

wEW (1) 

= Pr m6D [aJM^A i *M]. 

e(<5*) gives us the probability that agent i will take an incorrect action; it 
is in keeping with the error definition used in computational learning theory 
|l7j . We use it to gauge how well agent i is performing. An error of means 
that the agent is taking all the actions dictated by its target function. An 
error of 1 means that the agent never takes an action as dictated by its target 
function. Each action the agent takes is either correct or incorrect, that is, it 
either matches the target function or it docs not. We do not model degrees of 
incorrectness. However, since the error is defined as the average over all possible 
world states, an agent that takes the correct action in most world states will 
have a small error. Extending the theory to handle degrees of incorrectness is 
one of the subjects of your continuing work, see Section ^2 All the notation 
from this section is summarized in Figure |3J 

2.3 The Moving Target Function Problem 

The learning problem the agent faces is to change its so that it matches 

A|(w). If we imagine the space of all possible decision functions, then agent i's 
8\ and A* will be two points in this space, as shown in Figure 0] The agent's 
learning problem can then be re-stated as the problem of moving its decision 
function as close as possible to its target function, where the distance between 
the two functions is given by the error e{8\). This is the traditional machine 
learning problem. 

However, once agents start to change their decision functions (i.e., change 
their behaviors) the problem of learning becomes more complicated because 
these changes might cause changes in the other agents' target functions. We 
end up with a moving target function, as seen in Figure [5J In these systems, it 
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TV the set of all agents, where i £ N is one particular agent. 

W the set of possible states of the world, where w £ W is one particular 
state. 

Ai the set of all actions that agent i can take. 

8\ : W — > Ai the decision function for agent i at time t. It tells which 
action agent i will take in each world. 

A* : W — > Ai the target function for agent i at time t. It tells us what 
action agent i should take. It takes into account the actions that 
other agents will take. 

e(5\) — Pr[Sj(w) ^ A'(iu) \w £ V] the error of agent i at time t. It is the 
probability that i will take an incorrect action, given that the worlds 
w are taken from the fixed probability distribution V. 



Figure 3: Summary of notation used for describing a MAS and the agents in it. 
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is not clear if the error will ever reach or, more generally, what the expected 
error will be as time goes to infinity. Determining what will happen to an agent's 
error in such a system is what we call the moving target function problem, 
which we address in this article. However, we will first need to define some 
parameters that describe the capabilities of an agent's learning algorithm. 




Figure 5: The learning problem in learning MASs. 



2.4 A Model of Learning Algorithms 

An agent's learning algorithm is responsible for changing S\ into 5\ +1 so that it 
is a better match of A* . Different machine learning algorithms will achieve this 
match with different degrees of success. We have found a set of parameters that 
can be used to model the effects of a wide range of learning algorithms. The 
parameter are: Change rate, Learning rate, Retention rate, and Impact; and 
they will be explained in this section, except for Impact which will be introduced 
in Section [5j These parameters, along with the equations we provide, form the 
CLRI framework (the letters correspond to the first letter of the parameters' 
names). 

After agent i takes an action and receives some payoff, it activates its learning 
algorithm, as we showed in Figure [21 The learning algorithm is responsible for 
using this payoff in order to change 8\ into <5' +1 , making <5* +1 match A* as much 
as possible. We can expect that for some w it was true that 5j(w) = A*(u>), 
while for some other w this was not the case. That is, some of the w — > a.; 
mappings given by Sj(w) might have been incorrect. In general, a learning 
algorithm might affect both the correct and incorrect mappings. We will treat 
these two cases separately. 

We start by considering the incorrect mappings and define the change rate 
of the agent as the probability that the agent will change at least one of its 
incorrect mappings. Formally, we define the change rate Cj for agent i as 

y w Pr[^ + » ? Sl(w) I St(w) A'H] = <H. (2) 

The change rate tells us the likelihood of the agent changing an incorrect 
mapping into something else. This "something else" might be the correct action, 
but it could also be another incorrect action. The probability that the agent 
changes an incorrect mapping to the correct action is called the learning rate 
of the agent. It is defined as k where 

V„ Pr$ +1 H = A-(w) | SUw) ? A'H] - k. (3) 
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There are two constraints which must always be satisfied by these two rates. 
Since changing to the correct mapping implies that a change was made, the 
value of must be less than or equal to c,-, that is, hi < c, must always be true. 
Also, if \Ai\ — 2 then a — h since there are only two actions available, so the 
one that is not wrong must be right. 

The complementary value for the learning rate is 1 — li and refers to the 
probability that an incorrect mapping does not get changed to a correct one. An 
example learning rate of lj = .5 means that if agent i initially has all mappings 
wrong it will make half of them match the original target function after the first 
iteration. 

We now consider the agent's correct mappings and define the retention 
rate as the probability that a correct mapping will stay correct in the next 
iteration. The retention rate is given by 7"$ where 

\/ w Pr$+» = A\(w) I Sl(w) = Ai(w)}. = n. (4) 

We propose that the behavior of a wide variety of learning algorithms can be 
captured (or at least approximated) using appropriate values for a, U, and 
7"j. Notice, however, that these three rates claim that the w — ► a mappings 
that change are independent of the w that was just seen. We can justify this 
independence by noting that most learning algorithms usually perform some 
form of generalization. That is, after observing one world state w and the 
payoff associated with it, a typical learning algorithm is able to generalize what 
it learned to some other world states. This generalization is reflected in the 
fact that the change, learning, and retention rates apply to all w's. However, a 
more precise model would capture the fact that, in some learning algorithms, 
the mapping for the world state that was just seen is more likely to change than 
the mapping for any other world state. 

The rates are not time dependent because we assume that agents use one 
learning algorithm during their lifetimes. The rates capture the capabilities of 
this learning algorithm and, therefore, do not need to vary over time. 

Finally, we define volatility to mean the probability that the target function 
will change from time t to time t + 1. Formally, volatility is given by v t where 

V w Pr[A* +1 (w)^A t i (w)]=v i (5) 

In Section [S] we will show how to calculate Vi in terms of the error of the other 
agents. We will then see that volatility is not a constant but, instead, varies 
with time. 

3 Calculating the Agent's Error 

We now wish to write a difference equation that will let us calculate the agent's 
expected error, as defined in Eq. Q), at time t+1 given the error at time t and the 
other parameters we have introduced. We can do this by observing that there 
are two conditions that determine the new error: whether A* +1 (u>) = A*(w) or 
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not, and whether <5f(w) = A*(u>) or not. If wc define a = A' +1 (u>) = A*(u>), 
and b = Sl(w) — A*(u;), we can then say that we need to consider the four cases 
where: a A b, a A -b, ->a A 6, and ->a A Formally, this implies that 

Pr[Sl +1 (w) ^ Al +1 {w)} = 

Pr[c5* +1 H ¥= A* +1 (w) A a A b] + Pr[<5? +1 (tu) ^ A* +1 (u;) A a A -.&]+ (6) 
Pr$ +1 (w) ^ A*+V) A -a A 6] + Pr[(5* +1 H ^ A* +1 (w) A -a A -6], 

since the four cases are exclusive of each other. Applying the chain rule of 
probability, we can rewrite each of the four terms in order to get 



Pr[(5* +1 H + Al +1 (w)} = Pr[a A b] • Pr$ +1 (w) + \a/\b} + 

Pr[a A -nb] ■ Pr$ +1 (u;) ^ A* +1 (w) | a A -.&] + 
Prha A 6] • Pr[^ +1 (w) ^ A* + V) | -n<z A 6] + 
Pr[-.aA-.t] -Pr$ +1 (w) ^ A- +1 (w) | ->a A -ib]. 



(7) 



We can now find values for these conditional probabilities. We start with the 
first term where, after replacing the values of a and 6, we find that 

Pr[^ t+1 H ± A>+» | A'+V) = A-O) A 5\(w) = A*(w)] = 1 - n. (8) 

Since the target function does not change from time t to t + 1 and the agent was 
correct at time t, the agent will also be correct at time t + 1 ; unless it changes 
its correct w — > a mapping. The agent changes this mapping with probability 
1 

The value for the second conditional probability is 

Pr[<5* +1 M + A*+» | A< +1 H = A'H A «$?(«;) # A-(w)] = 1 - fc. (9) 

In this case the target function still stays the same but the agent was incorrect. 
If the agent was incorrect then it will change its decision function to match 
the target function with probability U. Therefore, the probability that it will 
be incorrect next time is the probability that it does not make this change, or 
l-k- 

The third probability has a value of 

Pr$+» ? A*+» | A* +1 H ± A*(w) A «5|(«,) = A\(w)] 

= (n + (i- n )-B) 1 ' 

In this case the agent was correct and the target function changes. This means 
that if the agent retains the same mapping, which it does with probability r^, 
then the agent will definitely be incorrect at time t+l. If it does not retain the 
same mapping, which happens with probability 1 — r*, then it will be incorrect 
with probability B, where 

B = Pr[5l +1 (w) + A^+VMH = A'(w) A A* +1 (w) + A-(w) 
AS*+V)^A*M]. 
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Finally, the fourth conditional probability has a value of 

Pr$ +1 H + A<+» | A* +1 (w) ± A\(w) A **(«;) + A\{w)\ 
= (1 - Ci)I> + h + (c< - h)F, 

where 

D = Pr[iJH + A* +1 HI^H # AJW A A<+» ± A\{w)] (13) 
F = Pr$ +1 H # Af»|<5» # AJ(«;) A A* +1 ( W ) ^ A'H (14) 
A5l + \w) ^ Al(w) ASl +1 (w) ^ 5t(w)}. 

This is the case where the target function changes and the agent was wrong. 
We have to consider three possibilities. The first possibility is for the agent 
not to change its decision function, which happens with probability 1 — Cj. The 
probability that the agent will be incorrect in this case is given by D. The 
second possibility, when the agent changes its mapping to the correct function, 
has a probability of U and ensures that the agent will be incorrect the next time. 
The third possibility happens, with probability Cj — h when the agent changes 
its mapping to an incorrect value. In this case, the probability that it will be 
wrong next time is given by F. 

We can substitute Eqs. ©, ©, lfTU|) . and l(T2*jl into Eq. J7J), substitute the 
values of a and b, and expand Pr[a A b] into Pr[a | b] ■ Pr[6], in order to get 

E[e(Sl +1 )}=E[J2 V(w)Pr[6i +1 (w)^A^\w)}} = £ V(w)( 

Pr[A< + » = Al(w)\Sl(w) = A*H] • Pr$(u>) = A*H] • (1 
+ Pr[A^» = A|H|^H ^ Al(w)] • Pr$H ^ A*( w )] • (1 
+ Pr[Al +1 (w) ^ Al{w)\Sl(w) = Al(w)] 

■Pr[St(w) = A t i (w)]-(r i + (l-r i )-B) 
+ Pr[Al +1 (w) ± At(w)\5t(w) ± A\(w)\ ■ Pr[5*H ± A\(w)\ 

■{l-c i )D + l l + {c l -l l )F. 

Equation l|15|l will model any MAS whose agent learning can be described with 
the parameters presented Section l2~4l and whose action/learn loop is the same 
as we have described. We can use Eq. (|15|) to calculate the successive expected 
errors for agent i, given values for all the parameters and probabilities. In the 
next section we show how this is done in a simple example game. 

3.1 The Matching game 

In this matching game we have two agents i and j each of whom, in every 
world w, wants to play the same action as the other one. Their set of actions is 
Ai = Aj, where we assume \ Ai\ > 2 (for \Ai\ =2 the equation is simpler). After 
every time step, the agents both learn and change their decision functions in 



k) 



(15) 
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accordance to their learning rates, retention rates, and change rates. Since the 
agents are trying to match each other, in this game it is always true that A ■ (w) = 
Sj(w) and A*(iu) = Sf(w). Given all this information, we can find values for 
some of the probabilities in Eq. (|15fl (including values for Equations i|13fl 
dJO an d rewrite (see Appendix ^] for derivation) it as: 

E[e(St +1 )} = £ V(w){r 3 ■ Vv[5\{w) = A\{w)] • (1 - n) 
+ (l~c J )-Pr[St(w)^Al(w)]-(l-l l ) 



(l-r j ).Pr[Sl(w) = A t i ( W )]-(r i + (l-r i ) 



\Ai\-2\\ (16) 



We can better understand this equation by plugging in some values and sim- 
plifying. For example, lets assume that r\ — r 3 ■ — 1 and ^ = lj = 1, which 
implies that Cj = Cj = 1. This is the case where the two agents always change 
all their incorrect mappings so as to match their respective target functions at 
time t. That is, if we had Sl(wi) — x and 5*(wi) = y, then at time f+1 we will 
have §\ + (wx) = y and S* +1 (wi) = x. This means that agent i changes all its 
incorrect mappings to match j, while j changes to match i, so all the mappings 
stay wrong after all (i.e., i ends up doing what j did before, while j does what 
i did before). The error, therefore, stays the same. We can see this by plugging 
the values into Eq. The first three terms will become and the fourth 

term will simplify to the definition of error, as given by Eq. Q . Since the fourth 
term is the only one that is non-zero, we end up with ^[e^*" 1 " 1 )] = e(<5*). 

We can also let Ci and U (keeping Cj = lj = 1) be arbitrary numbers, which 
gives us E[e(Sl +1 )] = Cie{8\). This tells us that the error will drop faster for 
a smaller change rate Cj. The reason is that i's learning (remember < Cj) in 
this game is counter-productive because it is always made invalid by j'a learning 
rate of 1. That is, since j is changing all its mappings to match i's actions, i's 
best strategy is to keep its actions the same (i.e., Cj = 0). 



4 Further Simplification 

We can further simplify Eq. (|15|l if we are willing to make two assumptions. 
The first assumption is that the new actions chosen when either 5\{w) changes 
(and does not match the target), or when A\(w) changes, are both taken from 
flat probability distributions over A;. By making this assumption we can find 
values for B, D, and F, namely: 

MM-i |A|-2 ; 

The second assumption we make is that the probability of A\ (w) changing, 
for a particular w, is independent of the probability that Sj(w) was correct. 
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In Section 13.11 we saw that in the matching game the probabilities of A* (w) 
and Sj(w) changing were correlated since, if 5\{w) was wrong then 5j(w) was 
also wrong, which meant j would probably change S^w), which would change 
A*(w). 

However, the matching game is a degenerate example in exhibiting such tight 
coupling between the agents' target functions. In general, we can expect that 
there will be a number of MASs where the probability that any two agents i and 
j are correct is uncorrelated (or loosely correlated). For example, in a market 
system all sellers try to bid what the buyer wants, so the fact that one seller 
bids the correct amount says nothing about another seller's bid. Their bids are 
all uncorrelated. In fact, the Distributed Artificial Intelligence literature is full 
of s yste ms that try to make the agents' decisions as loosely-coupled as possible 

CUEl 

This second assumption we are trying to make can be formally represented 
by having Eq. I|18|) be true for all pairs of agents i and j in the system. 

Pr[6l(w) = A\(w) A 8]{w) = A<»] 

= Pr[5$(w) = A*H] • Pr$(u;) = A} (10)] 1 ' 

Once we make these two assumptions we can rewrite Eq. <|15[) as: 
E[e(St +1 )} = £ 2?H(Pr[A^ +1 H = A'H] • (Pr$H = AfH] • (1 - n) 

+ Vv[5l{w)^A\{w)]-(l-k)) 
+ Pr[A^+ » * A\{w)] • (Pr[^M = A\{w)] ■ L + (1 - n) ■ (^j-f ) 

+ Pr[6j(w) ? A-(w)] • ( lAil ]l^[ +2k ))) 

(19) 

Some of the probabilities in this equation are just the definition of w,;, and others 
simplify to the agent's error. This means that we can simplify Eq. (|19H to: 



E[e(Sr l )]^l-r t +' 



\Ai\n - 1 
1^1 - 1 

e(5t) (n - k + Vl ( ^rff^r ~ ' ' (20) 



Eq. (|20|l is a difference equation that can be used to determine the expected 
error of the agent at any time by simply using i?[e(<5* +1 )] as the e{5\) for the 
next iteration. While it might look complicated, it is just the function for a line 
y = mx + b where x — e(5*) and y — e(£* +1 ). Using this observation, and the 
fact that e{5\ +1 ) will always be between and 1, we can determine that the 
final convergence point for the error is the point where Eq. I|20|) intersects the 
line y = x. The only exception is if the slope equals —1, in which case we will 
see the error oscillating between two points. 
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Figure 6: Error progression for agent i, assuming a fixed volatility Vi = .2, 
Ci = 1, l{ = .3, r, = 1, \Ai\ = 20. We show the error function (e(5' +1 )), as well 
as its two components: learning and volatility. The line y — x allows us trace 
the agent's error as it starts at .95 and converges to .44. 



By looking at Eq. JU we can also determine that there are two "forces" 
acting on the agent's error: volatility and the agent's learning abilities. The 
volatility tends to increase the agent's error past its current value while the 
learning reduces it. We can better appreciate this effect by separating the Vi 
terms in Eq. H20(l and plotting the Vi terms (volatility) and the rest of the terms 
(learning) as two separate lines. By definition, these will add up to the line 
given by Eq. (|2*0"|l . We have plotted these three lines and traced a sample error 
progression in Figure|Hl The error starts at .95 and then decreases to eventually 
converge to .44. We notice the learning curve always tries to reduce the agent's 
error, as confirmed by the fact that its line always falls below y = x. Meanwhile, 
the volatility adds an extra error. This extra error is bigger when the agent's 
error is small since, any change in the target function is then likely to increase 
the agent's error. 

5 Volatility and Impact 

Equation (|20|l is useful for determining the agent's error when we know the 
volatility of the system. However, it is likely that this value is not available 
to us (if we knew it we would already know a lot about the dynamics of the 
system). In this section we determine the value of Vi in terms of the other 
agents' changes in their decision functions. That is, in terms of Pr[(5* +1 ^ 5fj], 
for all other agents j. 

In order to do this we first need to define the impact Iji that agent j's 



13 



changes in its decision function have on j's target function. 

y wew I Tl = Pr[A* +1 ( w ) ± Af(to) | <5*+V) ^ £»] (21) 

We can now start to define volatility by first determining that, for two agents 
i and j 

V weW vj = Pv[A* +1 (w) ^ Ai(w)] 

= Pr[A< +1 H ? A-(w) | * **(«,)] ■ Pr$+» ^ 

+ Pr[A*+V) ^ AJ(ib) | = ' Pr$ +1 H = <$*>)]• 

(22) 

The reader should notice that volatility is no longer constant; it varies 
with time (as recorded by the superscript). The first conditional probability 
in Eq. I)22|l is just Iji. The second one we will set to 0, since we are specifically 
interested in MASs where the volatility arises only as a side-effect of the other 
agents' learning. That is, we assume that agent i's target function changes only 
when j's decision function changes. For cases with more than two agents, we 
similarly assume that one agent's target function changes only when some other 
agent's decision function changes. That is, we ignore the possibility that outside 
influences might change an agent's target function. 

We can simplify Eq. I|22|) and generalize it to N agents, under the assumption 
that the other agents' changes in their decision functions will not cancel each 
other out, making A' stay the same as a consequence. v\ then becomes 

= i n c 1 - ^ pr $ +i H + s j( w )])- (23) 

jeN-i 

We now need to determine the expected value of Pr[5' +1 (w) ^ $j(w)] for 
any agent. Using i instead of j we have 

V wew Pr$+» ? 6t(w)} 

= Pr[6f(w) ± A*H] • Pr[6l +1 (w) ? A\(w) | S^w) * A<(u,)] (24) 

+ Pr$(«;) = Al(w)} ■ Pr[^ + » ? Af( W ) | **(«,) = A\(w)], 

where the expected value is: 

E[Pr[Sl +1 (w) ? 5$(w)]] = ae(Si) + (1 - n) ■ (1 - e{8\)). (25) 

We can then plug Eq. 125|) into Eq. (|23[1 in order to get the expected volatility 

EM} = 1- II l-^ife-e^ + Cl-^-a-e^))). (26) 

We can use this expected value of v\ in Eq. (|2U[) in order to find out how the 
other agents' learning will affect agent i. In MASs that have identical learning 
agents (i.e., their c, I, r, and I rates are all the same and they start with the 
same initial error) we can replace the multiplier in Eq. (|26|l with an exponent 
of \N\ — 1. We use this simplification later in Section 
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Figure 7: Plot of Final Error for agent i, given U = lj = .2, rj = Tj = 1, 
^ = ^• = 1,1^-1 = 1^1=20. 



6 An Example with Two Agents 

In a MAS with just two agents i and j, we can use Eq. I|26|> to rewrite Eq. I|20[l 
as 



We can now use Eq. I|27|) to plot values for one particular example. Let us 
say that U = lj = .2, q = Cj = 1, fj = r.,- = 1, A,- = \Ai\ = 20 and we let 
the impacts Jy and Iji vary between zero and one. Figure [7] shows the final 
error, after convergence, for this situation. It shows an area where the error is 
expected to be below .1, corresponding to low values for either Iij, Iji or both. 
This area represents MASs that are loosely coupled, i.e., one agent's change 
in behavior does not significantly affect the other's target function. In these 
systems we can expect that the error will eventually 1 reach a value close to zero. 
We see that as the impact increases the final error also increases, with a fairly 
abrupt transition between a final error of and bigger final errors. This abrupt 
transition is characteristic of these types of systems where there are tendencies 
for the system to either converge or diverge, and both of them are self-enforcing 
behaviors. Notice also that the graph is not symmetric — ly has more weight in 
determining i's final error than Iji. This result seems counterintuitive, until we 
realize that it is j's error that makes it hard for i to converge to a small error. 

1 Notice that we are not representing how long it takes for the error to converge. This can 
easily be done and is just one more of the parameters our theory allows us to explore. 




(27) 
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Figure 8: Vector plot for e(Sj) and e(<5j), where \Ai\ — \Aj\ = 20, k = lj = .2, 
r{ = rj = 1, Ci = .5, Cj = 1, Iij = .1, Iji — .3. It shows the error progression for 
a pair agents i and j. For each pair of errors (e((5|), e(6*)), the arrows indicate 
the expected (e(c5* +1 ), e(<5* +1 )). 



If Jy is high then, if i has a large error then j's error will increase, which will 
make j change its decision function often and make it hard for i to reduce its 
error. If 1^ is low then, even if Iji is high, j will probably settle down to a low 
error and as it does i will also be able to settle down to a low error. 

If we were about to design a MAS we would try to build it so that it lies 
in the area where the final error is zero. This way we can expect all agents to 
eventually have the correct behavior. We note that a substantial percentage of 
the research in DAI and MAS deals with taking systems that are not inherently 
in this area of near-zero error and designing protocols and rules of encounter so 
as to move them into this area, as in |25j. 

The fact that the final error is 1 for the case with iy = Iji ~ 1 can seem 
non-intuitive to readers familiar with game theory. In game theory there are 
many games, such as the "matching game" from Section 13. II where two agents 
have an impact of 1 on each other. However, it is known 3] that, in these 
games, two learning agents will eventually converge to one of the equilibria (if 
there are any), making their final error equal to 0. This is certainly true, and 
it is exactly what we showed in Section 13.11 The same result is not seen in 
Figure [7| because the figure was plotted using our simplified Equation. (|20|) . 
which makes the simplifying independence assumption given by Eq. i|18|) . This 
assumption cannot be made in games such as the matching game because, in 
these games, there is a correlation between the correctness of each of the agents 
actions. Specifically, in the matching game it is always true that both agents are 
either correct, or incorrect, but it is never true that one of them is correct while 
the other one is incorrect, i.e., either they matched, or they did not match. 

Another view of the system is given by Figure |H1 which shows a vector plot 
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Figure 9: Comparison of observed and predicted error. 



of the agents' errors. We can see how the bigger errors are quickly reduced but 
the pace of learning decreases as the errors get closer to the convergence point. 
Notice also that an agent's error need not change in a monotonic fashion. That 
is, an agent's error can get bigger for a while before it starts to get smaller. 



7 A Simple Application 

In order to demonstrate how our theory can be used, we tested it on a simple 
market-based MAS. The game consists of three agents, one buyer and two seller 
agents i and j. The buyer will always buy at the cheapest price — but the sellers 
do not know this fact. In each time step the sellers post a price and the buyer 
decides which of the sellers to buy from, namely, the one with the lowest bid. 
The sellers can bid any one of 20 prices in an effort to maximize their profits. 
The sellers use a reinforcement learning algorithm with their reinforcements 
being the profit the agent achieved in each round, or if it did not sell the good 
at the time. In this system we had one good being sold (|W| = l).As predicted 
by economic theory, the price in this system settles to the sellers' marginal cost, 
but it takes time to get there due to the learning inefficiencies. 

We experimented with different dj rates 2 for the reinforcement learning of 
agent j, while keeping aj = .1 fixed, and plotted the running average of the 
error of agent i. A comparison is shown in Figure [5] Figure 9(a) gives the 



experimental results for three different values of otj. It shows i's average error, 
over 100 runs, as a function of time. Since both sellers start with no knowledge, 
their initial actions are completely random which makes their error equal to .5. 

2 a is the relative weight the algorithm gives to the most recent payoff, a = 1 means that 
it will forget all previous experience and use only the latest payoff to determine what action 
to take. 
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Then, depending on ay, z's error will either start to go down from there or will 
first go up some and then down. Eventually, i's error gets very close to 0, as 
the system reaches a market equilibrium. 

We can predict this behavior using Eq. i|27|) . Based on the game description, 
we set | Ai| = \Aj\ = 20, since there were 20 possible actions. We let = 
Tj = 1 because reinforcement learning with fixed payoffs enforces the condition 
that once an agent is taking the correct action it will never change its decision 
function to take a different action. The agent might, however, still take a wrong 
action but only when its exploration rate dictates it. 

We then let = Iji — .17 based on the rough calculation that each agent 
has an equal probability of bidding any one of the 20 prices. If A* = 20 then 
Iji for this situation is the probability that j was also bidding 20 or above, 
i.e., 1/20, times the probability that j's new price is lower than 20, i.e. 19/20. 
Similarly, if A* = 19 then Iji is equal to 2/20 times 18/20. The average of all 
of these probabilities is .17. A more precise calculation of the impact would 
require us to find it via experimentation by actually running the system. 

Finally, we chose k — lj = c L = Cj — .005 for the first curve (i.e., the one that 
compares with ay = .1). We knew that for such a low ctj the learning and change 
rate should be the same. The actual value was chosen via experimentation. The 
resulting curve is shown in Figure 9(b) At this moment, we do not possess a 
formal way of deriving learning and change rates from a-rates. 

For the second curve (ay = .3) we knew that, since only ay had changed 
from the first experiment, we should only change lj and Cj. In fact, these 
two values should only be increased. We found their exact values, again by 
experimentation, to be lj = .04, Cj = .4. For the third curve we found the 
values to be lj = .055, Cj — .8. 

One difference we notice between the experimental and the theoretical results 
is that the experimental results show a longer delay before the error starts to 
decrease. We attribute this delay to the agent's initially high exploration rate. 
That is, the agents initially start by taking all random actions but progressively 
reduce this rate of exploration. As the exploration rate decreases the discrepancy 
between our theoretical predictions and experimental results is reduced. 

In summary, while it is true that we found lj and Cj by experimentation, 
all the other values were calculated from the description of the problem. Even 
the relative values of lj and Cj follow the intuitive relation with ay that, as 
ay increases so does lj and (even more) Cj . Section El shows how to calculate 
lower bounds on the learning rate. We believe that this experiment provides 
solid evidence that our theory can be used to approximately determine the 
quantitative behaviors of MASs with learning agents. 
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8 Application of our Theory to Experiments in 
the Literature 

In this section we show how we can apply our theory to experimental results 
found in the AI and MAS literature. While we will often not be able to com- 
pletely reproduce the authors' results exactly, we believe that being able to 
reproduce the flavor and the main quantitative characteristics of experimental 
results in the literature shows that our theory can be widely applied and used 
by practitioners in this area of research. 



8.1 Claus and Boutilier 



Claus and Boutilier 6] study the dynamics of a system that contains two re- 
inforcement learning agents. Their first experiment puts the two agents in a 
matching game exactly like the one we describe in Section 13.11 with \Ai\ = 
\Aj\ = 2. Their results show the probability that both agents matched (i.e., 1 

Since they were using two reinforcement learning 

was 



- e(<5*)) as time progressed. 

agents, it was not surprising that the curve they saw, seen in Figure 10(a) 



nearly identical to the curve we saw in our experiments with the two buying 
agents (Figure 9(a) with atj = an = .1, except upside-down). 

We can reproduce their curve using our equation for the matching game 
ifTfiJl. The results can be seen in Figure |10(b)| 



Eq 



Our theory again fails to 



account for the initial exploration rate. We can, however, confirm that by time 
15 their Boltzmann temperature (the authors used Boltzmann exploration) had 
been reduced from an initial value of 16 to 3.29 and would keep decreasing by 
a factor of .9 each time step. This means that by time 15 the agents were, 
indeed, starting to do more exploitation (i.e., reduce their error) while doing 
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(a) Original Experiment (b) Theory and Experiment 

Figure 11: Comparing theory (b) with results from [3(| (a). 



little exploration. 



8.2 Shoham and Tennenholtz 

Shohan and Tennenholtz |30| investigate how learning agents might arrive at 
social conventions. The authors introduce a simple learning algorithm (strategy- 
selection rule) called highest cumulative reward (HCR) which their agents use for 
learning these conventions. Shoham and Tennenholtz also provide the results of 
a series of experiments using populations of learning agents. We try to reproduce 
the results they present in their Section 4.1 where they study the "coordination 
game" which is similar to our matching game, but with only two actions. 

The experiment in question involves 100 agents, all of them identical and all 
of them using HCR. At each time instant the agents take one of two available 
actions. The aim is for every pair of chosen agents to take the same action 
as each other. Agents are randomly made to form pairs. The agents update 
their behavior (i.e., apply HCR) after a given delay. The authors try a series 
of delays (from to 200) and show that increasing the update delay decreases 
the percentage of trials where, after 1600 iterations, at least 95% of the agents 
reached a convention. The authors show surprise at finding this phenomenon. 



Their results are reproduced in Figure 11(a) (cf. Figure 1 in their article). 



The number of actions for all agents is easily set to \Ai\ =2, which implies that 
we must have k = Ci. By examining HCR, it is easy to determine that = 1 
(i.e if an agent took the right action, it will only get more support for it). At 
first intuition, one's impulse is to set Iij = 1 for every pair of agents i and j. 
However, since there are 100 of them and only pairs of them interact at every 
time instant, the real impact is Jjj = 1/99. 

We will now convert from their units of measurement into ours. In Fig- 
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ure 11(a) we can see that their x-axis is called the update delay, which we will 
refer to as d. This value is the number of time units that pass before the agent 
is allowed to learn. For d = the agent learns after every interaction (i.e., on 
every time t), while for d = 200 the agent takes the same action for 200 time in- 
stances and only learns after every 200 iterations. This means that we must set 
li — p (d + i) wnere P > 0. The value of p depends on their learning algorithm's 
performance, but we know that it must be a small number (< 50) greater than 
0. Through some experimentation we settled on p = 6 (other values close to 
this one give similar results). Since in their graph they look at < d < 200, we 
must then look at U where < li < g Finally, we find the value of d in terms 
of li to be 

d = — - 1 (28) 
Pk 



The y-axis of Figure 11(a) is the success, i.e., number of trials, out of 4000, 
where at least 95% of the agents reached a convention. We will refer to this 
value as s. We know that in s/4000 of the trials at least 95% of the agents have 
error close to (i.e., reaching a convention means that the agents take the right 
action almost all the time), and for the rest of the trials the error was greater. 
We can approximately map this to an error by saying that in s /4000 of the trials 
the error was (a slight underestimate), while in 1 — s/4000 of the trials the 
error was 1 (a slight overestimate). We add these two up (the makes the first 
term disappear) and arrive at an equation that maps s to e(Sf). 

e ^K^bob-J (29) 

The mapping from d to s is given by their actual data. Their data can be 
fit by the following function: 

S = 3900-4.-^1! (30 ) 

Plugging Eq. l(2*5|) into Eq. (|2UJ), and the result into Eq. we finally arrive 
at a function that maps their experimental results into our units: 



4000 



(3900-4(lM-l)-( ^7 V 100)2 )) 



Final error = — (31) 

4000 v ' 

for the range < Zj < |. 

Now that we have values for Ci, U, ri, Iij, \Ai\, a range for li and an equation 
that maps their experimental results into our units, we can plot both functions, 



as seen in Figure 11(b) The x-axis was plotted on a log-scale in order to better 
show the shape of the experiment curve, otherwise it would appear mostly as a 
straight line. For our theory curve we used Equations (|20|l and J2HJ), and iterated 
for 1600 time units, just like in the experiment, and plotted the error at that 
point. For the experiment curve we used Eq. (|31[1 . We plotted both of these 
curves in the specified range for ij. The reader will notice that our theory was 
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able to make precise quantitative predictions. The maximum distance from our 
theory curve to the experimental curve is .05, which means that our predictions 
for the final error were, at worst, within 5% of the experimental values. Also, an 
error of about 5% was introduced when mapping from their success percentage 
s to our error. 

8.3 Others 

There are several other examples in the literature where we believe our theory 
can be successfully applied. 0, chapter 3.7] gives results of an experiment 
where two agents try to find each other in a 100 by 100 grid. He shows that if 
the grid has few obstacles it is faster if both agents move towards each other, 
while if there are many obstacles it is faster if one of the agents stays still 
while the other one searches for it. We believe that the number of obstacles is 
proportional to the change rate that the agents experience and, perhaps, to the 
impact that they have on each other. When there are no obstacles the agents 
never change their decision functions (because their initial Manhattan heuristics 
lead them in the correct path). As the number of obstacles increases, the agents 
will start to change their decision functions as they move, which will have an 
impact on the other agent's target function. If, however, one of them stays put, 
this means that his change rate is so the other agent's target function will 
stay still and he will be able to reach his target (i.e., error 0) quicker. 

Notice that the problem of a moving target that Ishida studies is different 
from the problem of a moving target function which we study. It is, however, 
interesting to note their similarities and how our theory can be applied to some 
aspects of that domain. 

Another possible example is given by [2Jj . They show two Q-learning agents 
trying to cooperate in order to move a block. The authors show how different a 
rates (/? in their article) affect the quality of the result that the agents converge 
to. This quality roughly corresponds to our error, except for the fact that their 
measurements implicitly consider some actions to be better than others, while 
we consider an action to be either correct or incorrect. This discrepancy would 
make it harder to apply our theory to their results but we still believe that a 
rough approximation is possible. Our future work includes the extension of the 
CLRI framework to handle a more general definition of error — one that attaches 
a utility to each state-action pair, rather than the simple correct /incorrect cat- 
egorization we use. 

9 Bounding the Learning Rate with Sample Com- 
plexity 

In the previous examples we have used our knowledge of the learning algorithms 
to determine the values of the agent's Cj, U, and parameters. However, there 
might be cases where this is not possible — the learning algorithm might be too 
complicated or unknown. It would be useful, in these cases, to have some other 
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measure of the agent's learning abilities, which could be used to determine some 
bounds on the values of these parameters. 

One popular measure of the complexity of learning is given by Probably 
Approximately Correct (PAC) theory [Tg, in the form of a measure called the 
sample complexity. The sample complexity gives us a loose upper bound on 
the number of examples that a consistent learning agent must observe before 
arriving at a PAC hypothesis. 

There are two important assumptions made by PAC-theory. The first as- 
sumption is that the agents are consistent learners 3 . Using our notation, a 
consistent learner is one who, once it has learned a correct w —> a mapping does 
not forget it. This simply means that the agent must have r; = 1. The second 
assumption is that the agent is trying to learn a fixed concept. This assumption 
makes A* +1 = A- true for all t. 

The sample complexity m of an agent's learning problem is given by 

,„>I(l„ffl). (32) 

where \H\ is the size of the hypothesis space for the agent. In other words, \H\ 
is the total number of different 5i(w) functions that the agent will consider. For 
an agent with no previous knowledge we have \H\ = \Ai\' w >. However, agents 
with previous knowledge might have smaller \H\, since this knowledge might be 
used to eliminate impossible mappings. If a consistent learning agent has seen 
m examples then, with probability at least (1 — 7), it has error at most e. 

While we cannot map the sample complexity m to a particular learning rate 
l i: we can use it to put a lower bound on the learning rate for a consistent 
learning agent. That is, we can find a lower bound for the learning rate of an 
agent who does not forget anything it has seen, and who is trying to learn a 
fixed target function. Since the agent does not forget anything it has seen, we 
can deduce that its retention rate must be = 1. Since the target function 
is not changing, we know that Pr[A* +1 (» ^ Af(iu)] = and Pr[A* +1 (» = 
A* (it;)] = 1. We can plug these values into Eq. lfTi3|l and simplify in order to 

§et ' E[e(5t+ 1 )]=e(St)-(l-l i ). (33) 

We can solve the difference Eq. (|33|) . for any time n, in order to get: 

E[e(S?)] = e(6°) ■ (1 - k)". (34) 

We now remember that after m time steps we expect, with probability (1 — 7), 
the error to be less than e. Since Eq. (|3"4*|) only gives us an expected error, 
not a probability distribution over errors, we cannot use it to calculate the 
likelihood of the agent having that expected error. That is, we cannot calculate 
the "probably" (7) part of probably approximately correct. We will, therefore, 
assume that the 7 chosen for m is small enough so that it will be safe to say 
that, after m time steps, the error is less than e. In a typical application one 

3 See l2-ll pl62] for a formal definition of a consistent learner. 
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uses a small 7 because it guarantees a high degree of certainty on the upper 
bound of the error. 

Since we can now safely say that, after m time steps, the error is less than e, 
we can then deduce that the U for this agent should be small enough such that, 
if n = m, then E[e(S^)] < e. This is expressed mathematically as: 



This equation is not defined for e(Sf) = 0. However, given our assumption of 
a fixed target function and r, = 1, we already know, from Eq. 1|33[) . that if an 
agent starts with an error of it will maintain this error of for any future time 
t > 0. Therefore, in this case, the choice of a learning rate has no bearing on 
the agent's error, which will always be 0. 

Equation (|36[) gives us a lower bound on the learning rate that a consistent 
learner must have, given that it has sample complexity m, and based on an 
error e and a sufficiently small 7. A designer of an agent that uses a reason- 
able learning algorithm can expect that, if his agent has sample complexity m 
(for e error), then his agent will have a learning rate of at least h, as given by 
Eq. I|36|) . Furthermore, if a designer is comparing two possible agent designs, 
each with a different sample complexity but both with similarly powerful learn- 
ing algorithms, he can calculate bounds on the learning rates of both agents 
and compare their relative performance. 



The topic of agents learning about agents arises often in the studies of com- 
plexity 0. In fact, systems where the agents try to adapt to endogenously 
created dynamics are being widely studied 0, Q|. In these systems, like in 
ours, the agents co-create their expectations as they learn and change their be- 
haviors. Complexity research uses simulated agents in an effort to understand 
the complex behaviors of these systems as observed in the real world. 

One example is the work of Arthur et. al. 2], who arrive at the conclusion 
that systems of adaptive agents, where the agents are allowed to change the 
complexity of their learning algorithms, end up in one of two regimes: a sta- 
ble/simple regime where it is trivial to predict an agent's future behavior, and a 
complex regime where the agents' behaviors are very complex. It is this second 
regime that interests complexity researchers the most. In it, the agents are able 
to reach some kind of "equilibrium" point in model building complexity. These 
same results are echoed by Darley and Kauffman [8| in a similar experiment. In 
this article we have not allowed the agents to dynamically change the complex- 
ity of their learning algorithms. Therefore, our dynamics are simpler. Allowing 



e(<5°) • (1 - kT < e- 



(35) 



We solve this equation for U in order to get: 




(36) 



10 Related Work 
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the agents to change their complexity amounts to allowing them to change the 
values of their c, I, and r parameters while learning. 

However, while complexity research is very important and inspiring, it is only 
partially relevant to our work. Our emphasis is on finding ways to predict the 
behavior of MASs composed of machine-learning agents. We are only concerned 
with the behavior of simpler artificial programmable agents, rather than the 
complex behavior of humans or the unpredictable behavior of animals. 

The dynamics of MASs have also been studied by Kephart et. al. . In 
this work the authors show how simple predictive agents can lead to globally 
cyclic or chaotic behaviors. As the authors explain, the chaotic behaviors were a 
result of the simple predictive strategies used by the agents. Unlike our agents, 
most of their agents are not engaged in learning, instead they use simple fixed 
predictive strategies, such as "if the state of the world was x ten time units 
before, then it will be x next time so take action a" . The authors later show 
how learnin g ca n be used to eliminate these chaotic global fluctuations. 

Mataric [22j has studied reinforcement learning in multi-robot domains. She 
notes, for example, how learning can give rise to social behaviors |23|. The 
work shows how robots can be individually programmed to produce certain 
group behaviors. It represents a good example of the usefulness and flexibility 
of learning agents in multi-agent domains. However, the author does not offer 
a mathematical justification for the chosen individual learning algorithms, nor 
does she explain why the agents were able to converge to the global behaviors. 
Our research hopes to provide the first steps in this direction. 

One particularly interesting approach is taken by Carmel and Markovitch 
. They work on model-based learning, that is, agents build models of other 
agents via observations. They use models based on finite state machines. The 
authors show how some of these models can be effectively learned via observation 
of the other agent's actions. The authors concentrate on the development of 
learning algorithms that would let one agent learn a finite-state machine model 
of another agent. They have not considered the case where two or more agents 
are simultaneously learning about each other, which we study in this article. 
However, their work is more general in the sense that they model agents as 
state machines, rather than the state-action pairs we use. 

Finally, a lot of experimental work has been done in the area of agents 
learning about agents 0,01. For example, Sen and Sekaran show how 
learning agents in simple MAS converge to system- wide optimal behavior. Their 
agents use Q-learning or modified classifier systems in order to learn. The 
authors implement these agents and compare the performance of the different 
learning algorithms for developing agent coordination. Hu and Wellman [13 . IT^ 
have studied reinforcement learning in market-base MASs, showing how certain 
initial learning biases can be self-fulfilling, and how learning can be useful but 
is affected by an agent's models of other agents. Claus and Boutilier [|J have 
also carried out experimental studies of the behavior of reinforcement learning 
agents. We have been able to use the CLRI framework to predict some of their 
experimental results j3^| . Other researchers [13, HH El have extended the basic 
Q-learning |35j | algorithm for use with MASs in an effort to either improve or 
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prove convergence to the optimal behavior. 

We have also successfully experimented with reinforcement learning simula- 
tions [3^1 , but we believe that the formal treatment elucidated in these pages 
will shed more light into the real nature of the problem and the relative im- 
portance of the various parameters that describe the capabilities of an agent's 
learning algorithm. 

11 Limitations and Future Work 

The CLRI framework places some constraints on the type of systems it can 
model, which limits its usability. However, it is important to understand that, 
as we remove the limitations from the CLRI framework, the dynamics of the 
system become much harder to predict. In the extreme, without any limitations 
on the agents' abilities, the system becomes a complex adaptive system, as 
studied by Holland 01 an d others in the field of complexity. The dynamic 
behavior of these systems continues to be studied by complexity researchers 
with only modest progress. It is only by placing limitations on the system that 
we were able to predict the expected error of agents in the systems modeled by 
the CLRI framework. 

Our ongoing work involves the relaxation of some of the constraints made by 
the CLRI framework so that it may become more easily and widely applicable, 
without making the system dynamics impossible to analyze. We are targeting 
three specific constraints. 

1. The values of Cj, ij, rj, and Jjj cannot, in all situations, be mathematically 
determined from the system's description. We have found that bounds for 
the Ci, li, and r<j values can often be determined when using reinforcement 
learning or supervised learning. However, the bounds are often very loose. 
The values of the Iij parameter depend on the particular system. Some- 
times it is trivial to calculate the impact, sometimes it requires extensive 
simulation. 

2. The CLRI framework assumes that an agent's action is either correct or 
incorrect. The framework does not allow degrees of correctness. Specifi- 
cally, in many systems the agents can often take several actions, any one 
of which is equally good. When modeling these systems, the CLRI frame- 
work requires the user to designate one of those actions as the correct one, 
thereby ignoring some possibly useful information. 

3. The world states are taken from a uniform probability distribution which 
does not change over time. The environment is assumed to be episodic. 
As such, the framework is limited in the type of domains it can effectively 
describe. 

We are attacking these challenges with some of the same tools used by re- 
searchers in complex adaptive systems, namely, agent-based simulations and 
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co-evolving utility landscapes. We believe we can gain some insight into the dy- 
namics of adaptive MASs by constructing and analyzing various types of MASs. 
We also believe that the next step for the CLRI framework is the replacement of 
the current error definition with a utility function. The agents can then be seen 
as searching for the maximum value in the changing utility landscape defined 
by their utility function. The degree to which the agents are successful on their 
climb to the landscape peaks depends on the abilities of their learning algo- 
rithm (change rate, learning rate, and retention rate), and the speed at which 
the landscape changes as the other agents change their behavior (impact). 

The use of utility landscapes will allow us to consider an agent's utility 
for any particular action, rather than simply considering whether an action is 
correct or incorrect. The landscapes will also allow us to consider systems where 
agents cannot travel between any two world states in one time step. That is, the 
agents' moves on the landscape will be constrained in the same manner as their 
actions or behaviors are constrained the actual system. Finally, the new theory 
will likely need to redefine the CLRI parameters. We hope the new parameters 
will be easy to derive directly from the values that govern the machine-learning 
algorithms' behavior. These extensions will make the new theory applicable to 
a much wider set of domains. 

12 Summary 

We have presented a framework for studying and predicting the behavior of 
MASs composed of learning agents. We believe that this framework captures the 
most important parameters that describe an agents' learning and the system's 
rules of encounter. Various comparisons between the framework's predictions 
and experimental results were given. These comparisons showed that the theo- 
retical predictions closely match our experimental results and the experimental 
results published by others. Our success in reproducing these results allows us 
to confidently state the effectiveness and accuracy of our theory in predicting 
the expected error of machine learning agents in MASs. 

Since our theory describes an agent's behavior at a high-level (i.e., the agent's 
error), it is not capable of making system-specific predictions (e.g., predicting 
the particular actions that are favored). These types of system-specific predic- 
tions can only be arrived at by the traditional method of implementing popula- 
tions of such agents and testing their behaviors. However, we expect that there 
will be times when the predictions from our theory will be enough to answer a 
designer's questions. A MAS designer that only needs to determine how "good" 
the agent's behavior will be could probably use the CLRI framework. A de- 
signer that needs to know which particular emergent behaviors will be favored 
by his agents will need to implement the agents. 

Finally, while we have given some examples as to how learning rates can be 
determined for particular machine learning implementations, we do not have 
any general method for determining these rates. However, we showed how to 
use the sample complexity of a learning problem to determine a lower bound on 
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the learning rate of a consistent learning agent. This bound is useful for quickly 
ruling out the possibility of having agents with high expected errors and of 
stating that an agent's expected error will be, at most, a certain constant value. 
Still, if the agent's learning algorithm is much better than the one assumed by 
a consistent learner (e.g., the agent is very good at generalizing from one world 
state to many others), then these lower bounds could be significantly inaccurate. 



A Derivation for Matching Game 

If we can assume that the action chosen when an agent changes 8\ (w) and the 
result does not match A*(u>) (for some specific w) is taken from a flat probability 
distribution, then we can say that: 

We will now show how to calculate the fourth term in (|16[) . For the matching 
game we find that we can set: 

D = 1 — (38) 
F = l j + (l-l j )( l M Z ^). 09) 



Having \Ai\ = 2 implies that Cj = li, this means that for this case we have 

(i - Ci )D + k + (a - k)F = h + (i- a)(i - (40) 

For the case where \Ai\ > 2, which is the case we are interested in, we can 
plug in the values for D and F and simplify, in order to get the fourth term: 



Cilj(\Ai\ + - Cl 

\Ai 



(i - a)D + h + {a - k)F = i-i j + ^ n1 ^' ,^7T lJ > (41) 
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